Objectives

  • Data Exploration
  • Plot types
    • Bar plot and pie chart
    • Histogram
    • Scatter plot
    • Text plot
    • Boxplot and violin plot
    • Line chart

1. Data Exploration

In this section we’ll continue using the colonrectal cancer dataset. But besides sub group information, we’ll explore other sample information.

CRC <- read.csv("./data/CRC_train.csv")
str(CRC[,73:79])
## 'data.frame':    200 obs. of  7 variables:
##  $ Sample         : Factor w/ 200 levels "P1A1","P1A10",..: 2 4 5 7 11 12 13 16 18 21 ...
##  $ Group          : Factor w/ 2 levels "CRC","Healthy": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Age            : int  60 70 65 65 62 55 61 52 89 81 ...
##  $ Gender         : Factor w/ 2 levels "female","male": 1 2 2 1 1 2 2 2 1 2 ...
##  $ Cancer_stage   : int  1 1 1 4 3 2 2 4 4 1 ...
##  $ Tumour_location: Factor w/ 2 levels "colon","rectum": 1 2 2 1 1 1 2 2 1 2 ...
##  $ Sub_group      : Factor w/ 3 levels "Benign","CRC",..: 2 2 2 2 2 2 2 2 2 2 ...
# Inspect the possible values for the annotation columns
unique(CRC[, 'Cancer_stage'])
## [1]  1  4  3  2 NA
unique(CRC[, 'Tumour_location'])
## [1] colon  rectum <NA>  
## Levels: colon rectum
unique(CRC[, 'Sub_group'])
## [1] CRC     Benign  Healthy
## Levels: Benign CRC Healthy

2. Plot types

2.1 Bar plot and pie chart

geom_bar() makes the height of the bar proportional to the number of cases in each group.

library(ggplot2)
# Investigate the number of each sub_group
g1 <- ggplot(CRC, aes(Sub_group)) 
g1 + geom_bar()

# horizontal barplot
g1 + geom_bar() + 
  coord_flip()

We can also calculate the number of patients in each group manually and then make bar plot with stat="identity"

library(dplyr)
# count the number of samples in each sub group
n_subgroup <- CRC %>% group_by(Sub_group) %>% summarise(N = n())
n_subgroup
## # A tibble: 3 × 2
##   Sub_group     N
##      <fctr> <int>
## 1    Benign    34
## 2       CRC   100
## 3   Healthy    66
ggplot(n_subgroup, aes(x=Sub_group, y=N)) + 
  geom_bar(stat="identity")

We want to know the count of different gender patients in each sub group

g2 <- ggplot(CRC, aes(Sub_group, fill=factor(Gender)))
# Stacked barplot puts overlapping bars on top of each other
g2 + geom_bar(position = "stack")

# place overlapping bars side-by-side
g2 + geom_bar(position = "dodge")

#displays proportions
g2 + geom_bar(position = "fill") 

Pie chart is changed from bar chart through coord_polar()

ggplot(data = CRC) + 
  geom_bar(mapping = aes(x = factor(1), fill = factor(Sub_group))) + 
  coord_polar(theta = "y")

2.2 Histogram

We want to Investigate the abundance distribution of one protein. geom_histogram() shows the distribution of a single variable

g3 <- ggplot(CRC, aes(SERPINA3))
g3 + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

freqency polygons show the distribution of a single variable with lines

g3 + geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Challenge If you want to know the density distribution of protein AFM among all the observations, what plot will you make?

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\) \(~\)

g <- ggplot(CRC, aes(AFM))
g + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

g + geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.3 Scatter plot

We want to investigate the relation between two proteins. geom_point() is used to create scatterplots. The scatterplot is most useful for displaying the relationship between two continuous variables.

g <- ggplot(CRC, aes(x = SERPINA3, y = TIMP1))
g + geom_point()

If you have a scatterplot with a lot of noise, it can be hard to see the dominant pattern. In this case it’s useful to add a smoothed line to the plot with geom_smooth()

g + geom_point() + 
  geom_smooth()
## `geom_smooth()` using method = 'loess'

smoothScatter() produces a smoothed color density representation of a scatterplot

smoothScatter(CRC$SERPINA3, CRC$TIMP1)

Challenge If you want to the relationship of abundance between protein AFM and AHSG, what plot will you make?

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\) \(~\)

ggplot(CRC, aes(x = AFM, y = AHSG))+
  geom_point()+
  geom_smooth()
## `geom_smooth()` using method = 'loess'

2.4 Text plot

geom_text can add text to the plot

g1 <- ggplot(CRC, aes(x = Sample, y = TIMP1))
g1 + geom_text(aes(label = Age))

check_overlap can remove the overlapped texts

g1 + geom_text(aes(label = Age), check_overlap = TRUE)

We can aslo add points to the plot.

g1 + geom_text(aes(label = Age), check_overlap = TRUE) +
  geom_point()

nudge_x and nudge_y can offset the text

g1 + geom_text(aes(label = Age), nudge_y = 0.05, check_overlap = TRUE) +
  geom_point(shape = 1, aes(color = Sub_group))

ggrepel implements functions to repel overlapping text labels away from each other and away from the data points that they label.

library(ggrepel)
g1 + geom_point(shape = 1, aes(color = Sub_group)) +
  geom_text_repel( aes( label = Age),  check_overlap = F, 
             position=position_jitter(width=.1, height=.2)  )
## Warning: Ignoring unknown parameters: check_overlap, position

Adding labels to bar plot

# create a ggplot data
ggplot(n_subgroup, aes(x=Sub_group, y=N)) +
  # draw the bar plot
  geom_bar(stat="identity") +
  # create the number text above the bar in white
  geom_text(aes(label=N), vjust=1.5, colour="white")

2.5 Boxplot and violin plot

We want to investigate how protein abundance varies across different groups. geom_boxplot() summarise the shape of the distribution with a handful of summary statistics.

b <- ggplot(CRC, aes(Sub_group, SERPINA3)) 
b + geom_boxplot()

We can also overlay multiple geoms by simply adding them one after the other.

b + geom_boxplot() + geom_jitter() 

b + geom_jitter() + geom_boxplot()

b + geom_jitter(alpha = 0.5) + geom_boxplot()

We want to consider another factor gender

b + geom_boxplot(aes(fill= factor(Gender)), position = "dodge")

A violin plot is a mirrored density plot displayed in the same way as a boxplot.

b + geom_violin()

2.6 Line chart

We want to know how the protein abundance changes across different samples.

ggplot(CRC[1:100,],aes(Sample, SERPINA3, group = 1)) + 
  geom_point(size = 0.5) +
  geom_line() 

Challenge If you want to compare the abundance distribution of protein AFM among different groups, what plot will make? And you also want to see the difference between male and female in each group, what change will make? How to make the boxplot horizontal direction? Add data points to the boxplot. \(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\)

\(~\) \(~\)

ggplot(CRC,aes(Sub_group, AFM)) + 
  geom_boxplot()

ggplot(CRC,aes(Sub_group, AFM, fill= factor(Gender))) + 
  geom_boxplot(position = "dodge")

ggplot(CRC,aes(Sub_group, AFM, fill= factor(Gender))) + 
geom_boxplot(position = "dodge") +
  coord_flip()

ggplot(CRC,aes(Sub_group, AFM, fill= factor(Gender))) + 
  geom_boxplot(position = "dodge")+
  geom_jitter()